Introduction to Python Data Processing and Visualisation

Adam Ruszkowski

29/09/2017

Purpose

  • In this presentation, we'll briefly cover some of the methods that can be used to process and visualise data in Python
  • This is intended to be a brief overview only - code examples are included, but will not be explained in great detail

Why Python?

  • Python allows us to quickly and reproducibly process data
  • The syntax of the language is (relatively!!) easy to understand
  • Numerous modules exist which take out some of the hard work

Modules

  • Python allows users to group together defined objects, variables and functions into a file (or several files) so that they can be used in other programs
  • These are referred to as "modules"
  • We will use several specialised modules which have to be installed separately by the user:
    • numpy
    • matplotlib/seaborn
    • plotly
  • We have to import these into our program
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
sns.set_context('poster')
import plotly
import plotly.graph_objs as go
plotly.offline.init_notebook_mode(connected=True)

Data Processing

Pandas

  • Pandas is a Python module which describes itself as:

    an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

  • It processes data quickly, and has good interactions with Excel and CSV files
  • Generally, data is imported into a DataFrame, which is a bit like an Excel worksheet - it uses columns and row to store data in a tabular format
  • Mostly we refer to columns by their names - note that using the correct case is critical (sitename is not the same as SiteName)

Reading in Data

  • Reading in data from a CSV file is very straightforward
  • We could read in all of the locations at once, but for now we'll pick just one
In [2]:
original_data = pd.read_csv('VBV_Data\VBV_20170131_JTC00567_shay_2940.csv', low_memory=False)
original_data.head(15)
Out[2]:
RecordedDataTime SiteName SpeedMPH Class Length(Metres) Chasis Ht Info Gap Lane Direction
0 2016-06-01 00:01:04 JTC00567 68 2 4 NaN Car / Small Van <5.2 metres NaN 1 S
1 2016-06-01 00:01:09 JTC00567 64 2 4 NaN Car / Small Van <5.2 metres 49.0 1 S
2 2016-06-01 00:01:12 JTC00567 57 5 18 NaN Rigid HGV + Trailer / Standard Articulated (Ca... NaN 1 N
3 2016-06-01 00:01:25 JTC00567 76 4 6 NaN Rigid HGV / Large Van 158.0 1 S
4 2016-06-01 00:02:30 JTC00567 58 4 8 NaN Rigid HGV / Large Van NaN 2 N
5 2016-06-01 00:02:30 JTC00567 57 5 16 NaN Rigid HGV + Trailer / Standard Articulated (Ca... 774.0 1 N
6 2016-06-01 00:03:05 JTC00567 53 5 15 NaN Rigid HGV + Trailer / Standard Articulated (Ca... 1002.0 1 S
7 2016-06-01 00:04:41 JTC00567 58 5 18 NaN Rigid HGV + Trailer / Standard Articulated (Ca... 950.0 1 S
8 2016-06-01 00:04:44 JTC00567 77 2 4 NaN Car / Small Van <5.2 metres 1335.0 1 N
9 2016-06-01 00:04:53 JTC00567 66 2 3 NaN Car / Small Van <5.2 metres 110.0 1 S
10 2016-06-01 00:04:58 JTC00567 57 5 16 NaN Rigid HGV + Trailer / Standard Articulated (Ca... 138.0 1 N
11 2016-06-01 00:05:08 JTC00567 51 5 16 NaN Rigid HGV + Trailer / Standard Articulated (Ca... 146.0 1 S
12 2016-06-01 00:05:10 JTC00567 52 2 4 NaN Car / Small Van <5.2 metres 21.0 1 S
13 2016-06-01 00:05:50 JTC00567 61 2 4 NaN Car / Small Van <5.2 metres 507.0 1 N
14 2016-06-01 00:06:04 JTC00567 48 5 17 NaN Rigid HGV + Trailer / Standard Articulated (Ca... 141.0 1 N

Processing Data

  • Once we have the data imported into pandas, we can start to manipulate it.
  • Python (and pandas) can manipulate dates and times, if they're the correct kind of object.
  • We can convert the strings, and calculate hourly counts in a few lines of code
In [3]:
# Create a "Count" column, which will be 1 for each record to begin wth
original_data['Count'] = 1

# Make sure dates are used as dates, and set them as the index so we can summarise
original_data['DateTime'] = pd.to_datetime(original_data['RecordedDataTime'])
original_data.set_index('DateTime', inplace=True)

# Work out the vehicle count per class and direction for each hour
hourly_counts = original_data.groupby('Info') \
                             .resample('1H', level=0)['Count'] \
                             .count() \
                             .reset_index() \
                             .fillna(0)
hourly_counts
Out[3]:
Info DateTime Count
0 Bus / Coach 2016-06-01 04:00:00 1
1 Bus / Coach 2016-06-01 05:00:00 6
2 Bus / Coach 2016-06-01 06:00:00 12
3 Bus / Coach 2016-06-01 07:00:00 14
4 Bus / Coach 2016-06-01 08:00:00 24
5 Bus / Coach 2016-06-01 09:00:00 14
6 Bus / Coach 2016-06-01 10:00:00 10
7 Bus / Coach 2016-06-01 11:00:00 10
8 Bus / Coach 2016-06-01 12:00:00 8
9 Bus / Coach 2016-06-01 13:00:00 19
10 Bus / Coach 2016-06-01 14:00:00 19
11 Bus / Coach 2016-06-01 15:00:00 20
12 Bus / Coach 2016-06-01 16:00:00 12
13 Bus / Coach 2016-06-01 17:00:00 11
14 Bus / Coach 2016-06-01 18:00:00 5
15 Bus / Coach 2016-06-01 19:00:00 2
16 Bus / Coach 2016-06-01 20:00:00 1
17 Bus / Coach 2016-06-01 21:00:00 0
18 Bus / Coach 2016-06-01 22:00:00 0
19 Bus / Coach 2016-06-01 23:00:00 1
20 Bus / Coach 2016-06-02 00:00:00 1
21 Bus / Coach 2016-06-02 01:00:00 0
22 Bus / Coach 2016-06-02 02:00:00 0
23 Bus / Coach 2016-06-02 03:00:00 0
24 Bus / Coach 2016-06-02 04:00:00 0
25 Bus / Coach 2016-06-02 05:00:00 5
26 Bus / Coach 2016-06-02 06:00:00 11
27 Bus / Coach 2016-06-02 07:00:00 20
28 Bus / Coach 2016-06-02 08:00:00 16
29 Bus / Coach 2016-06-02 09:00:00 12
... ... ... ...
1114 Unclassified 2016-06-06 09:00:00 0
1115 Unclassified 2016-06-06 10:00:00 0
1116 Unclassified 2016-06-06 11:00:00 0
1117 Unclassified 2016-06-06 12:00:00 0
1118 Unclassified 2016-06-06 13:00:00 0
1119 Unclassified 2016-06-06 14:00:00 2
1120 Unclassified 2016-06-06 15:00:00 0
1121 Unclassified 2016-06-06 16:00:00 0
1122 Unclassified 2016-06-06 17:00:00 0
1123 Unclassified 2016-06-06 18:00:00 0
1124 Unclassified 2016-06-06 19:00:00 0
1125 Unclassified 2016-06-06 20:00:00 0
1126 Unclassified 2016-06-06 21:00:00 0
1127 Unclassified 2016-06-06 22:00:00 0
1128 Unclassified 2016-06-06 23:00:00 0
1129 Unclassified 2016-06-07 00:00:00 0
1130 Unclassified 2016-06-07 01:00:00 0
1131 Unclassified 2016-06-07 02:00:00 0
1132 Unclassified 2016-06-07 03:00:00 0
1133 Unclassified 2016-06-07 04:00:00 0
1134 Unclassified 2016-06-07 05:00:00 0
1135 Unclassified 2016-06-07 06:00:00 0
1136 Unclassified 2016-06-07 07:00:00 1
1137 Unclassified 2016-06-07 08:00:00 0
1138 Unclassified 2016-06-07 09:00:00 0
1139 Unclassified 2016-06-07 10:00:00 0
1140 Unclassified 2016-06-07 11:00:00 0
1141 Unclassified 2016-06-07 12:00:00 0
1142 Unclassified 2016-06-07 13:00:00 0
1143 Unclassified 2016-06-07 14:00:00 1

1144 rows × 3 columns

  • We can add as many different columns as we need when grouping together our data:
In [4]:
# Work out the vehicle count per class and direction for each hour
hourly_counts_d = original_data.groupby(['Info', 'Direction']) \
                               .resample('1H', level=0)['Count'] \
                               .count() \
                               .reset_index() \
                               .fillna(0)
hourly_counts_d.head(15)
Out[4]:
Info Direction DateTime Count
0 Bus / Coach N 2016-06-01 04:00:00 1
1 Bus / Coach N 2016-06-01 05:00:00 5
2 Bus / Coach N 2016-06-01 06:00:00 7
3 Bus / Coach N 2016-06-01 07:00:00 5
4 Bus / Coach N 2016-06-01 08:00:00 18
5 Bus / Coach N 2016-06-01 09:00:00 4
6 Bus / Coach N 2016-06-01 10:00:00 7
7 Bus / Coach N 2016-06-01 11:00:00 6
8 Bus / Coach N 2016-06-01 12:00:00 4
9 Bus / Coach N 2016-06-01 13:00:00 12
10 Bus / Coach N 2016-06-01 14:00:00 11
11 Bus / Coach N 2016-06-01 15:00:00 11
12 Bus / Coach N 2016-06-01 16:00:00 5
13 Bus / Coach N 2016-06-01 17:00:00 6
14 Bus / Coach N 2016-06-01 18:00:00 1

Exporting Data

  • Now that we have our data processed to provide the summaries we are interested in, we can export this summary back to a CSV as easily as we read the original data in
In [5]:
hourly_counts.to_csv('JTC00567 Hourly Summaries.csv', index=False)
hourly_counts_d.to_csv('JTC00567 Directional Hourly Summaries.csv', index=False)
  • We can also export any number of DataFrames to Excel
In [6]:
# Create a "writer"
excel_writer = pd.ExcelWriter('JTC00567.xlsx')

# Write each DataFrame to the writer separately, specifying the name we want to use for the worksheet
original_data.to_excel(excel_writer, 'Raw Counts', index=False)
hourly_counts.to_excel(excel_writer, 'Hourly Counts', index=False)
hourly_counts_d.to_excel(excel_writer, 'Hourly Counts_Directional', index=False)

# Save the "writer"
excel_writer.save()

Data Visualisation - matplotlib/seaborn

  • matplotlib is a plotting library for Python, allowing for the production of a wide range of graphs
  • seaborn extends matplotlib by adding support for further types of graphs, improved interaction with pandas dataframes and generally by making matplotlib graphs prettier
  • By bringing seaborn into our project, we use its colour schemes even when we plot graphs with matplotlib (which in some cases is easier)
  • Any of the graphs shown here can be saved as an image file, allowing them to be used in reports or similar

Line Charts

  • matplotlib can plot line charts from our data, and has a special interface for dealing with dates
  • Note that we don't have to specify the different classes - it loops over all of them automatically when we use groupby
In [7]:
# Set up the graph size
fig, ax = plt.subplots(figsize=(15, 5))

# Get the data for each class in the DataFrame
for vehicle_class, data in hourly_counts.groupby('Info'):
    # Plot the data for that class
    ax.plot_date(data['DateTime'], data['Count'], label=vehicle_class, linestyle='-', marker='')

# Set up legend
plt.legend(frameon=True, facecolor='white', 
           loc='lower center', ncol=2, bbox_to_anchor=(0.5, -0.6))

# Set up axis labels
plt.xlabel('Date')
plt.ylabel('Vehicles')

# Fix margins properly
plt.ylim(bottom=0)
ax.margins(x=0)
plt.title('Total Flow - JTC00567')
Out[7]:
<matplotlib.text.Text at 0x22b889ea1d0>
  • Suppose we want to plot both directions separately - we can do this using subplots
In [8]:
# Set up the subplots
fig, all_axes = plt.subplots(2, figsize=(15, 10), sharex=True, sharey=True)

# Get the data for each class in the DataFrame
for i, (direction, directional_data) in enumerate(hourly_counts_d.groupby('Direction')):
    # Group by direction
    for vehicle_class, data in directional_data.groupby('Info'):
        # Plot the data for that class and direction
        all_axes[i].plot_date(data['DateTime'], data['Count'], label=vehicle_class, linestyle='-', marker='')
        all_axes[i].margins(x=0)
        all_axes[i].set_title(direction)
        all_axes[i].set_ylabel('Vehicles')

# Set up legend
plt.legend(frameon=True, facecolor='white', 
           loc='lower center', ncol=2, bbox_to_anchor=(0.5, -0.7))

# Set up labels
plt.xlabel('Date')
fig.suptitle('Flow by Direction - JTC00567')

# Fix margins properly
plt.ylim(bottom=0)
Out[8]:
(0, 1582.3499999999999)

Histograms

  • Suppose we want to plot a histogram of the car speeds
  • pandas includes a method that lets a DataFrame be exported as a histogram
In [22]:
# Set up the graph size
fig, ax = plt.subplots(figsize=(15, 5))

# Filter our original DataFrame to only contain cars (class 2)
cars = original_data[original_data['Info'] == 'Car / Small Van <5.2 metres']

# Set up the bins we want to use in our histogram
bin_size = 10
bins = range(0, max(cars['SpeedMPH']) + bin_size, bin_size)

# Plot the histogram
cars.hist(column='SpeedMPH', bins=bins, ax=ax, alpha=0.5)

# Set up labels and title
plt.xlabel('Speed (mph)')
plt.ylabel('Number of Vehicles')
plt.title('Car/Light Van Speeds Histogram')

# Fix the margins
ax.margins(x=0)

Heatmaps

  • One of the advantages of seaborn over matplotlib is the ease with which heatmaps can be made
  • We can simply provide the rows, columns and value we want to represent
  • Suppose we want to create a heatmap of our car flows - each row representing a day and each column an hour
  • We need to adjust our data into the correct format - first we'll filter to just cars and add in some nice columns for the date and time
In [25]:
# Filter to Cars and vans, Northbound
nb_cars = hourly_counts_d[(hourly_counts_d['Info'] == 'Car / Small Van <5.2 metres')  &
                          (hourly_counts_d['Direction'] == 'N')].copy()

# Add columns for day and hour
nb_cars['Day'] = nb_cars['DateTime'].dt.weekday_name
nb_cars['Hour'] = nb_cars['DateTime'].dt.time
nb_cars.head(15)
Out[25]:
Info Direction DateTime Count Day Hour
321 Car / Small Van <5.2 metres N 2016-06-01 00:00:00 26 Wednesday 00:00:00
322 Car / Small Van <5.2 metres N 2016-06-01 01:00:00 19 Wednesday 01:00:00
323 Car / Small Van <5.2 metres N 2016-06-01 02:00:00 19 Wednesday 02:00:00
324 Car / Small Van <5.2 metres N 2016-06-01 03:00:00 38 Wednesday 03:00:00
325 Car / Small Van <5.2 metres N 2016-06-01 04:00:00 79 Wednesday 04:00:00
326 Car / Small Van <5.2 metres N 2016-06-01 05:00:00 256 Wednesday 05:00:00
327 Car / Small Van <5.2 metres N 2016-06-01 06:00:00 561 Wednesday 06:00:00
328 Car / Small Van <5.2 metres N 2016-06-01 07:00:00 696 Wednesday 07:00:00
329 Car / Small Van <5.2 metres N 2016-06-01 08:00:00 688 Wednesday 08:00:00
330 Car / Small Van <5.2 metres N 2016-06-01 09:00:00 767 Wednesday 09:00:00
331 Car / Small Van <5.2 metres N 2016-06-01 10:00:00 900 Wednesday 10:00:00
332 Car / Small Van <5.2 metres N 2016-06-01 11:00:00 818 Wednesday 11:00:00
333 Car / Small Van <5.2 metres N 2016-06-01 12:00:00 811 Wednesday 12:00:00
334 Car / Small Van <5.2 metres N 2016-06-01 13:00:00 852 Wednesday 13:00:00
335 Car / Small Van <5.2 metres N 2016-06-01 14:00:00 874 Wednesday 14:00:00
  • Next, we'll pivot this toget the days as rows and hours as columns.
In [23]:
# The week isn't named alphabetically! Track the order we want
column_order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']

# Create a pivot table
pivot = nb_cars.pivot_table(index='Day', columns='Hour', values='Count')

# Sort our rows according to the days of the week
pivot = pivot.reindex_axis(column_order, axis=0)
pivot
Out[23]:
Hour 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 ... 14:00:00 15:00:00 16:00:00 17:00:00 18:00:00 19:00:00 20:00:00 21:00:00 22:00:00 23:00:00
Day
Sunday 74 39 30 32 41 101 215 325 453 701 ... 830 838 839 615 524 392 305 154 85 80
Monday 40 16 22 46 176 382 543 716 697 868 ... 924 1073 1173 667 523 363 291 151 81 48
Tuesday 22 19 20 23 82 305 515 631 649 709 ... 828 1001 1150 690 516 408 290 177 67 69
Wednesday 26 19 19 38 79 256 561 696 688 767 ... 874 1062 1130 797 561 397 309 174 114 124
Thursday 62 26 15 37 83 292 536 648 715 818 ... 927 1093 1168 861 635 482 313 212 102 43
Friday 20 23 24 33 77 242 475 635 631 839 ... 1226 1320 1507 1278 852 542 337 253 146 88
Saturday 48 29 36 37 85 186 351 531 756 1163 ... 970 914 775 613 461 320 218 202 151 108

7 rows × 24 columns

  • From this data, creating a heatmap is very simple
In [12]:
# Set up the graph size
fig, ax = plt.subplots(figsize=(15, 5))

# Create the heatmap
sns.heatmap(pivot)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x22b7a2887b8>
  • If we choose, we can annotate this heatmap with the counts
In [27]:
# Set up the graph size
fig, ax = plt.subplots(figsize=(15, 5))

# Create the heatmap
sns.heatmap(pivot, annot=True, fmt='d', annot_kws={'size': 12})
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x22b180ccdd8>

Box and Whisker Plots

  • Box and whisker plots are good for showing the distribution of data
  • For this example, let's collect the data for every site observed
  • As different sites use different classifications, we'll filter to car data only by using a partial match on the word "Car"
In [14]:
import os

# Read in data for every site
all_data = pd.concat([pd.read_csv(os.path.join('VBV_Data', f), low_memory=False)
                      for f in os.listdir('VBV_Data')
                      if f.startswith('VBV')])

# Filter to just the rows corresponding to Cars
all_cars = all_data[all_data['Info'].str.contains('Car')]
  • Next, we again use seaborn directly to create a box plot
In [15]:
# Set up the graph size
fig, ax = plt.subplots(figsize=(15, 8))

# Create the box and whisker plot
box_plot = sns.boxplot(data=all_cars, x='SiteName', y='SpeedMPH')

# Rotate the x-axis labels
for lbl in box_plot.get_xticklabels():
    lbl.set_rotation(90)

# Fix the margins
ax.margins(y=0)

Data Visualisation - plotly

  • plotly is a different data visualisation library that allows for the creation of interactive graphs
  • This can be useful for clients who enjoy being able to interrogate the underlying data themselves
  • However, they need to be provided as HTML files, which may unfamiliar for some clients, and can be slow to load if a lot of data is involved in the plot.

Line Charts

  • Standard line charts are similar to the process followed in matplotlib
In [29]:
# Set up an empty list to store each of the lines in
lines = []
for vehicle_class, data in hourly_counts.groupby('Info'):
    # Set up the line plot for each vehicle type, and append to the list
    line = go.Scatter(x=data['DateTime'], y=data['Count'], name=vehicle_class)
    lines.append(line)

# Set up layout
layout = go.Layout(legend=dict(orientation='h',
                               xanchor='center',
                               y=-0.2,
                               x=0.5),
                   xaxis=dict(title='Date'),
                   yaxis=dict(title='Vehicle Count'),
                   title='Total Flow - JTC00567',
                   autosize=False,
                   width=1000,
                   height=600)

# Combine lines and layout, and plot
figure = go.Figure(data=lines, layout=layout)
plotly.offline.iplot(figure)
  • Showing the two directions separately side-by-side with this much interactivity might be a bit of an information overload
  • We can reduce this slightly by using a drop-down menu to let the user choose the direction they want to view.
In [30]:
# Set up empty lists
lines = []
type_counts = []
for direction, directional_data in hourly_counts_d.groupby('Direction'):
    # Initialise counter
    i = 0
    for vehicle_class, data in directional_data.groupby('Info'):
        # Set up the line plot for each vehicle type, and append to a list
        line = go.Scatter(x=data['DateTime'], y=data['Count'], name=vehicle_class,
                          visible=not len(type_counts))
        lines.append(line)
        # Increment counter
        i += 1
    # Add the number of types for this direction to the list
    type_counts.append(i)

# Set up update menus
updatemenus = list([
    dict(active=0,
         showactive=True,
         buttons=list([   
            dict(label = 'N',
                 method = 'update',
                 args = [{'visible': [True] * type_counts[0] + [False] * type_counts[1]},
                         {'title': 'Flow by Direction - JTC00567 (N)'}]),
            dict(label = 'S',
                 method = 'update',
                 args = [{'visible': [False] * type_counts[0] + [True] * type_counts[1]},
                         {'title': 'Flow by Direction - JTC00567 (S)'}])
         ])
        )
])

        
# Set up layout
layout = go.Layout(legend=dict(orientation='h',
                               xanchor='center',
                               y=-0.2,
                               x=0.5),
                   xaxis=dict(title='Date'),
                   yaxis=dict(title='Vehicle Count'),
                   autosize=False,
                   width=1000,
                   height=600,
                   updatemenus=updatemenus)

# Combine lines and layout, and plot
figure = go.Figure(data=lines, layout=layout)
plotly.offline.iplot(figure)

Histograms

  • Again, the process is fairly similar to matplotlib
In [31]:
# Set up the histogram
hist = go.Histogram(x=cars['SpeedMPH'], opacity=0.5,
                    autobinx=False, xbins=dict(start=0,
                                               end=max(cars['SpeedMPH']),
                                               size=10))

# Set up the layout
layout = go.Layout(xaxis=dict(title='Speed (mph)'),
                   yaxis=dict(title='Number of Vehicles'),
                   title='Car/Light Van Speeds Histogram',
                   autosize=False,
                   width=1000,
                   height=600)

# Combine the histogram and layout
figure = go.Figure(data=[hist], layout=layout)
plotly.offline.iplot(figure)

Heatmaps

  • The heatmap is slightly different - the matrix format we used for matplotlib won't work here.
In [19]:
# We need to sort our data so the y axis is shown in the correct order
column_order = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
nb_cars['Day'] = pd.Categorical(nb_cars['Day'],
                                categories=column_order,
                                ordered=True)
nb_cars.sort_values(['Day', 'Hour'], inplace=True)

# Generate the heatmap
heat = go.Heatmap(x=nb_cars['Hour'],
                  y=nb_cars['Day'],
                  z=nb_cars['Count'],
                  hoverinfo='z')

# Generate the layout
layout = go.Layout(xaxis=dict(tickangle=-90),
                   yaxis=dict(autorange='reversed'),
                   title='Car Counts Heatmap',
                   autosize=False,
                   width=1000,
                   height=600)

# Combine the data and layout
figure = go.Figure(data=[heat], layout=layout)
plotly.offline.iplot(figure)

Box and Whisker Plots

  • We have to set up the box plots individually but otherwise the process is similar
  • This graph is on the next slide as it may take a while to load
In [20]:
# Set up the individual boxes
boxes = [go.Box(y=data['SpeedMPH'], name=site, boxmean=True)
         for site, data in all_cars.groupby('SiteName')]

# Set up the layout
layout = go.Layout(xaxis=dict(title='Site',
                              tickangle=-90),
                   yaxis=dict(title='Speed (mph)'),
                   title='Car Speeds Box Plot',
                   autosize=False,
                   width=1000,
                   height=600)

# Combine the data and layout
figure = go.Figure(data=boxes, layout=layout)
plotly.offline.iplot(figure)

Project Examples

Scottish Canals

  • The client provided us with counts of non-motorised users on the towpath across the canal network
  • Using pandas, matplotlib and seaborn, we were able to produce a dashboard of graphs for each counter identifying the daily totals, average weekday hourly counts, speeds by class and directional splits
  • With this information, it was possible to quickly identify a number of counters reporting erroneous values

Train2Ride

  • The client was considering the impact of reducing the number of cycle spaces on trains
  • Using historical booking data, we were able to analyse every route provided and calculate the proportion of existing cycle spaces in use
  • With plotly, we provided the client with a series of interactive graphs (one per route), with which they could test the impact of reducing the cycle space capacity using a slider
  • This gave visual information (by colouring red the historic services that would have been over capacity if the specified reduction had been in place) while also providing the exact number of services that would have been affected

Summary

  • This presentation has covered only the basics of data processing and just a few examples of the graphs that can be produced with these libraries
  • Run in a non-interactive form of Python, the approximate 300 lines of code shown here took 62 seconds to run, of which 48 seconds was spent writing to CSVs and the Excel file. The creation of all of the graphs took approximately 14 seconds.
  • Pandas has many further functions, and there are a lot of other libraries to use in data processing, including database libraries such as sqlite3
  • There are many other graphs types covered by matplotlib, seaborn and plotly and also various other data visualisation libraries available such as Bokeh and HoloViews